{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `plot()`: analyze distributions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function `plot()` explores the distributions and statistics of the dataset. It generates a variety of visualizations and statistics which enables the user to achieve a comprehensive understanding of the column distributions and their relationships. The following describes the functionality of `plot()` for a given dataframe `df`.\n",
    "\n",
    "1. `plot(df)`: plots the distribution of each column and computes dataset statistics\n",
    "2. `plot(df, col1)`: plots the distribution of column `col1` in various ways, and computes its statistics\n",
    "3. `plot(df, col1, col2)`: generates plots depicting the relationship between columns `col1` and `col2`\n",
    "\n",
    "The generated plots are different for numerical, categorical and geography columns. The following table summarizes the output for the different column types.\n",
    "\n",
    "| `col1` | `col2` | Output |\n",
    "| --- | --- | --- |\n",
    "| None | None | dataset statistics, [histogram](https://www.wikiwand.com/en/Histogram) or [bar chart](https://www.wikiwand.com/en/Bar_chart) for each column |\n",
    "| Numerical | None | column statistics, histogram, [kde plot](https://www.wikiwand.com/en/Kernel_density_estimation), [qq-normal plot](https://www.wikiwand.com/en/Q%E2%80%93Q_plot), [box plot](https://www.wikiwand.com/en/Box_plot) |\n",
    "| Categorical | None | column statistics, bar chart, [pie chart](https://www.wikiwand.com/en/Pie_chart), [word cloud](https://www.wikiwand.com/en/Tag_cloud), word frequencies |\n",
    "| Geography | None | column statistics, bar chart, [pie chart](https://www.wikiwand.com/en/Pie_chart), [word cloud](https://www.wikiwand.com/en/Tag_cloud), word frequencies, world map |\n",
    "| Numerical | Numerical | [scatter plot](https://www.wikiwand.com/en/Scatter_plot), [hexbin plot](https://www.data-to-viz.com/graph/hexbinmap.html), binned box plot|\n",
    "| Numerical | Categorical | categorical box plot, multi-[line chart](https://www.wikiwand.com/en/Line_chart) |\n",
    "| Categorical | Numerical | categorical box plot, multi-line chart\n",
    "| Categorical | Categorical | [nested bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [stacked bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [heat map](https://www.wikiwand.com/en/Heat_map) |\n",
    "| Categorical | Geography | [nested bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [stacked bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [heat map](https://www.wikiwand.com/en/Heat_map) |\n",
    "| Geography | Categorical | [nested bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [stacked bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [heat map](https://www.wikiwand.com/en/Heat_map) |\n",
    "| Geopoint | Categorical | [nested bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [stacked bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [heat map](https://www.wikiwand.com/en/Heat_map) |\n",
    "| Categorical | Geopoint | [nested bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [stacked bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [heat map](https://www.wikiwand.com/en/Heat_map) |\n",
    "| Numerical | Geography | categorical box plot, multi-[line chart](https://www.wikiwand.com/en/Line_chart), world map |\n",
    "| Geography | Numerical | categorical box plot, multi-[line chart](https://www.wikiwand.com/en/Line_chart), world map |\n",
    "| Numerical | Geopoint | geo map|\n",
    "| Geopoint | Numerical | geo map|\n",
    "\n",
    "Next, we demonstrate the functionality of `plot()`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the dataset\n",
    "`dataprep.eda` supports **Pandas** and **Dask** dataframes. Here, we will load the well-known [adult dataset](http://archive.ics.uci.edu/ml/datasets/Adult) into a Pandas dataframe using the load_dataset function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.datasets import load_dataset\n",
    "import numpy as np\n",
    "df = load_dataset('adult')\n",
    "df = df.replace(\" ?\", np.NaN)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get an overview of the dataset with `plot(df)`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start by calling `plot(df)` which computes dataset-level statistics, a histogram for each numerical column, and a bar chart for each categorical column. The number of bins in the histogram can be specified with the parameter `bins`, and the number of categories in the bar chart can be specified with the parameter `ngroups`. If a column contains missing values, the percent of missing values is shown in the title and ignored when generating the plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-12T01:42:21.974553Z",
     "start_time": "2020-11-12T01:42:06.719623Z"
    }
   },
   "outputs": [],
   "source": [
    "from dataprep.eda import plot\n",
    "plot(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understand a column with `plot(df, col1)`\n",
    "\n",
    "After getting an overview of the dataset, we can thoroughly investigate a column of interest `col1` using `plot(df, col1)`. The output is of `plot(df, col1)` is different for numerical and categorical columns.\n",
    "\n",
    "When `col1` is a numerical column, it  computes column statistics, and generates a histogram, kde plot, box plot and qq-normal plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-12T01:42:23.072781Z",
     "start_time": "2020-11-12T01:42:21.979715Z"
    }
   },
   "outputs": [],
   "source": [
    "plot(df, \"age\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When `x` is a categorical column, it computes column statistics, and plots a bar chart, pie chart, word cloud, word frequency and word length:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-12T01:42:28.435038Z",
     "start_time": "2020-11-12T01:42:23.076615Z"
    }
   },
   "outputs": [],
   "source": [
    "plot(df, \"education\")"
   ]
  },
  {
   "source": [
    "When `x` is a Geography column, it computes column statistics, and plots a bar chart, pie chart, word cloud, word frequency, word length and world map:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_geo = load_dataset('countries')\n",
    "plot(df_geo, \"Country\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understand the relationship between two columns with `plot(df, col1, col2)`\n",
    "\n",
    "Next, we can explore the relationship between columns `col1` and `col2` using `plot(df, col1, col2)`. The output depends on the types of the columns. \n",
    "\n",
    "When `col1` and `col2` are both numerical columns, it generates a scatter plot, hexbin plot and box plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-12T01:42:31.475241Z",
     "start_time": "2020-11-12T01:42:28.441074Z"
    }
   },
   "outputs": [],
   "source": [
    "plot(df, \"age\", \"hours-per-week\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When `col1` and `col2` are both categorical columns, it plots a nested bar chart, stacked bar chart and heat map:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-12T01:42:35.560295Z",
     "start_time": "2020-11-12T01:42:31.479771Z"
    }
   },
   "outputs": [],
   "source": [
    "plot(df, \"education\", \"marital-status\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When `col1` and `col2` are one each of type numerical and categorical, it generates a box plot per category and a multi-line chart:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-12T01:42:40.093930Z",
     "start_time": "2020-11-12T01:42:35.568210Z"
    }
   },
   "outputs": [],
   "source": [
    "plot(df, \"age\", \"education\")\n",
    "# or plot(df, \"education\", \"age\")"
   ]
  },
  {
   "source": [
    "When `col1` and `col2` are one each of type geopoint and categorical, or, geography and categorical, it generates a box plot per category and a multi-line chart:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.eda.dtypes_v2 import LatLong\n",
    "covid = load_dataset('covid19')\n",
    "latlong = LatLong(\"Lat\", \"Long\") # create geopoint type using \"LatLong\" function by inputing two columns names\n",
    "plot(covid, latlong, \"Country/Region\")\n",
    "# or plot(covid, \"Country/Region\", latlong)\n",
    "\n",
    "plot(df_geo,\"Country\", \"Region\")\n",
    "# or plot(df_geo, \"Region\", \"Country\")"
   ]
  },
  {
   "source": [
    "When `col1` and `col2` are one each of type geography and numerical, it generates a box plot per category, a multi-line chart and a world map:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(df_geo,\"Country\", \"Population\")\n",
    "# or plot(df_geo, \"Population\", \"Country\")"
   ]
  },
  {
   "source": [
    "When `col1` and `col2` are one each of type geopoint and numerical, it generates a geo map:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(covid, latlong, \"2/16/2020\")\n",
    "# or plot(covid, \"2/16/2020\", latlong)"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  },
  "metadata": {
   "interpreter": {
    "hash": "cbc395162ceef86bd30db76a3f31f271f845529665561e61c8ad64483b646fe3"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}